feat(websearch): 新增Exa搜索提供商,支持 Tavily/Exa API Base URL 可配置,补充搜索工具相关文档#7359
feat(websearch): 新增Exa搜索提供商,支持 Tavily/Exa API Base URL 可配置,补充搜索工具相关文档#7359piexian wants to merge 15 commits intoAstrBotDevs:masterfrom
Conversation
- 新增 Exa 搜索提供商,包含三个工具: - web_search_exa:语义搜索,支持 5 种搜索类型和 6 个垂直领域 - exa_extract_web_page:通过 /contents 端点提取网页全文 - exa_find_similar:通过 /findSimilar 端点查找语义相似网页 - Tavily 和 Exa 的 API Base URL 可在 WebUI 中配置,方便代理/自建实例 - 所有联网搜索工具统一添加可配置 timeout 参数(最小 30s) - MessageList.vue 引用解析支持 Exa/BoCha/findSimilar - 更新配置元数据、i18n、路由及 hooks - 更新中英文用户文档,补充 Tavily/BoCha/百度AI搜索的工具参数说明
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The minimum-timeout enforcement logic (
if timeout < 30: timeout = 30) is duplicated across many tools (fetch_url, Tavily/BoCha/Exa helpers, etc.); consider extracting a small utility (or a module-levelMIN_TIMEOUTconstant plus helper) to centralize this behavior and avoid inconsistencies (e.g.,_web_search_exacurrently lacks the clamp). - The Exa API key missing error message is in Chinese in
_get_exa_keywhile other user-facing errors in this module are English; aligning these messages to a consistent language will make debugging and UX more coherent. - The lists of supported web-search tools for reference extraction are now duplicated in multiple places (e.g.,
astr_agent_hooks._extract_web_search_refs, dashboard routes, andMessageList.vue), which makes it easy to miss a spot when adding new providers; consider centralizing this mapping or deriving it from a shared config to keep UI and backend behavior in sync.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The minimum-timeout enforcement logic (`if timeout < 30: timeout = 30`) is duplicated across many tools (`fetch_url`, Tavily/BoCha/Exa helpers, etc.); consider extracting a small utility (or a module-level `MIN_TIMEOUT` constant plus helper) to centralize this behavior and avoid inconsistencies (e.g., `_web_search_exa` currently lacks the clamp).
- The Exa API key missing error message is in Chinese in `_get_exa_key` while other user-facing errors in this module are English; aligning these messages to a consistent language will make debugging and UX more coherent.
- The lists of supported web-search tools for reference extraction are now duplicated in multiple places (e.g., `astr_agent_hooks._extract_web_search_refs`, dashboard routes, and `MessageList.vue`), which makes it easy to miss a spot when adding new providers; consider centralizing this mapping or deriving it from a shared config to keep UI and backend behavior in sync.
## Individual Comments
### Comment 1
<location path="astrbot/builtin_stars/web_searcher/main.py" line_range="159-168" />
<code_context>
self,
cfg: AstrBotConfig,
payload: dict,
+ timeout: int = 30,
) -> list[SearchResult]:
"""使用 Tavily 搜索引擎进行搜索"""
</code_context>
<issue_to_address>
**suggestion:** Normalize the timeout value inside `_web_search_exa` for consistency and safety.
Other helpers (`_get_from_url`, `_web_search_tavily`, `_extract_tavily`, `_web_search_bocha`, `_extract_exa`, `_find_similar_exa`) all enforce a minimum 30s timeout internally, while `_web_search_exa` relies on its caller (`search_from_exa`) to clamp the value. If `_web_search_exa` is reused elsewhere, it may see much smaller timeouts and behave inconsistently. Please add the same `if timeout < 30: timeout = 30` guard at the top of `_web_search_exa` to align behavior and avoid unexpectedly short timeouts.
Suggested implementation:
```python
) -> list[SearchResult]:
"""使用 Exa 搜索引擎进行搜索"""
if timeout < 30:
timeout = 30
```
If the `_web_search_exa` signature or docstring differ slightly (e.g., different Chinese text or no docstring), adjust the SEARCH pattern to match the actual function header and insert:
if timeout < 30:
timeout = 30
as the first statement in the function body, immediately after any docstring, to keep behavior consistent with `_get_from_url`, `_web_search_tavily`, `_extract_tavily`, `_web_search_bocha`, `_extract_exa`, and `_find_similar_exa`.
</issue_to_address>
### Comment 2
<location path="astrbot/builtin_stars/web_searcher/main.py" line_range="68" />
<code_context>
"""清理文本,去除空格、换行符等"""
return text.strip().replace("\n", " ").replace("\r", " ").replace(" ", " ")
- async def _get_from_url(self, url: str) -> str:
+ async def _get_from_url(self, url: str, timeout: int = 30) -> str:
"""获取网页内容"""
</code_context>
<issue_to_address>
**issue (complexity):** Consider extracting shared helpers for timeout handling, base-URL construction, and Exa HTTP requests to remove duplication and make the web search code easier to maintain.
You can keep all the new functionality while cutting a lot of duplication with a few small helpers. The main hot spots are timeout handling, base URL construction, and Exa HTTP calls.
### 1. Centralize timeout normalization
The `if timeout < 30: timeout = 30` pattern is repeated many times.
Add a helper:
```python
def _normalize_timeout(self, timeout: int | None, minimum: int = 30) -> aiohttp.ClientTimeout:
if timeout is None:
timeout = minimum
elif timeout < minimum:
timeout = minimum
return aiohttp.ClientTimeout(total=timeout)
```
Then use it at call sites instead of repeating the logic:
```python
async def _web_search_tavily(self, cfg: AstrBotConfig, payload: dict, timeout: int = 30) -> list[SearchResult]:
tavily_key = await self._get_tavily_key(cfg)
base_url = self._tavily_base_url(cfg)
url = f"{base_url}/search"
header = {
"Authorization": f"Bearer {tavily_key}",
"Content-Type": "application/json",
}
timeout_obj = self._normalize_timeout(timeout)
async with aiohttp.ClientSession(trust_env=True) as session:
async with session.post(url, json=payload, headers=header, timeout=timeout_obj) as response:
...
```
And for tools you can drop the inline clamp:
```python
@llm_tool(name="fetch_url")
async def fetch_website_content(self, event: AstrMessageEvent, url: str, timeout: int = 30) -> str:
timeout_obj = self._normalize_timeout(timeout)
resp = await self._get_from_url(url, timeout_obj.total)
return resp
```
(or just pass `timeout_obj` through if you adjust `_get_from_url`).
### 2. Extract base URL helpers for providers
The Tavily and Exa base URL logic is repeated.
Add:
```python
def _tavily_base_url(self, cfg: AstrBotConfig) -> str:
return (
cfg.get("provider_settings", {})
.get("websearch_tavily_base_url", "https://api.tavily.com")
.rstrip("/")
)
def _exa_base_url(self, cfg: AstrBotConfig) -> str:
return (
cfg.get("provider_settings", {})
.get("websearch_exa_base_url", "https://api.exa.ai")
.rstrip("/")
)
```
Then simplify call sites:
```python
base_url = self._tavily_base_url(cfg)
url = f"{base_url}/search"
```
```python
base_url = self._exa_base_url(cfg)
url = f"{base_url}/contents"
```
This removes duplication and keeps provider-specific config in one place.
### 3. Consolidate Exa HTTP request logic
`_web_search_exa`, `_extract_exa`, and `_find_similar_exa` all repeat the same HTTP boilerplate. You can pull that out into one internal helper that deals with key retrieval, base URL, headers, timeout, and error handling:
```python
async def _exa_request(
self,
cfg: AstrBotConfig,
path: str,
payload: dict,
timeout: int = 30,
) -> dict:
exa_key = await self._get_exa_key(cfg)
base_url = self._exa_base_url(cfg)
url = f"{base_url}/{path.lstrip('/')}"
header = {
"x-api-key": exa_key,
"Content-Type": "application/json",
}
timeout_obj = self._normalize_timeout(timeout)
async with aiohttp.ClientSession(trust_env=True) as session:
async with session.post(url, json=payload, headers=header, timeout=timeout_obj) as response:
if response.status != 200:
reason = await response.text()
raise Exception(
f"Exa request to {path} failed: {reason}, status: {response.status}",
)
return await response.json()
```
Then each high-level method only shapes payload and maps results:
```python
async def _web_search_exa(
self,
cfg: AstrBotConfig,
payload: dict,
timeout: int = 30,
) -> list[SearchResult]:
data = await self._exa_request(cfg, "search", payload, timeout=timeout)
results: list[SearchResult] = []
for item in data.get("results", []):
results.append(
SearchResult(
title=item.get("title", ""),
url=item.get("url", ""),
snippet=(item.get("text") or "")[:500],
)
)
return results
```
```python
async def _extract_exa(
self, cfg: AstrBotConfig, payload: dict, timeout: int = 30
) -> list[dict]:
data = await self._exa_request(cfg, "contents", payload, timeout=timeout)
results: list[dict] = data.get("results", [])
if not results:
raise ValueError("Error: Exa content extraction does not return any results.")
return results
```
```python
async def _find_similar_exa(
self, cfg: AstrBotConfig, payload: dict, timeout: int = 30
) -> list[SearchResult]:
data = await self._exa_request(cfg, "findSimilar", payload, timeout=timeout)
results: list[SearchResult] = []
for item in data.get("results", []):
results.append(
SearchResult(
title=item.get("title", ""),
url=item.get("url", ""),
snippet=(item.get("text") or "")[:500],
)
)
return results
```
This way, if you change headers, auth, or error handling, you only touch `_exa_request`.
### 4. Optional: small helpers for repeated validation
If you want to further simplify the public tool methods, a couple of tiny validators can keep them focused on behavior rather than plumbing.
For example, Exa config check and clamping:
```python
def _ensure_exa_config(self, cfg: AstrBotConfig) -> None:
if not cfg.get("provider_settings", {}).get("websearch_exa_key", []):
raise ValueError("Error: Exa API key is not configured in AstrBot.")
def _clamp_results(self, value: int, minimum: int, maximum: int) -> int:
return max(minimum, min(value, maximum))
```
Usage:
```python
@llm_tool("web_search_exa")
async def search_from_exa(..., max_results: int = 10, ...):
...
cfg = self.context.get_config(umo=event.unified_msg_origin)
self._ensure_exa_config(cfg)
max_results = self._clamp_results(max_results, 1, 100)
...
```
These changes keep all functionality (timeouts, base URLs, Exa/Tavily features) but reduce repetition and make future changes safer and easier.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Code Review
This pull request introduces the Exa search provider, adding tools for semantic search, web page extraction, and finding similar links. It also adds support for configurable base URLs for Tavily and Exa, and implements a minimum 30-second timeout across various search and extraction tools. Feedback includes addressing a potential IndexError during Exa API key rotation, reusing aiohttp sessions for efficiency, improving error handling when no extraction results are found, and refactoring tool management logic to reduce duplication.
There was a problem hiding this comment.
Pull request overview
This PR extends AstrBot’s web search capabilities by adding an Exa provider (semantic search + extraction + similar-page discovery), making Tavily/Exa API base URLs configurable for proxy/self-hosted endpoints, and updating the dashboard and docs to reflect the expanded toolchain and citation parsing.
Changes:
- Add Exa as a new
websearch_provider, includingweb_search_exa,exa_extract_web_page, andexa_find_similarLLM tools. - Make Tavily/Exa API Base URL configurable and thread it through web search + URL extraction/KB upload flows.
- Update WebUI citation/ref parsing and expand websearch documentation (ZH/EN) plus config metadata i18n.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| docs/zh/use/websearch.md | Expands ZH docs for default/Tavily/Exa/BoCha/Baidu tool parameters and configuration. |
| docs/en/use/websearch.md | Expands EN docs for provider options, tool parameters, and Base URL configuration. |
| dashboard/src/i18n/locales/zh-CN/features/config-metadata.json | Adds metadata strings for Tavily/Exa Base URL + Exa key. |
| dashboard/src/i18n/locales/en-US/features/config-metadata.json | Adds metadata strings for Tavily/Exa Base URL + Exa key. |
| dashboard/src/i18n/locales/ru-RU/features/config-metadata.json | Adds metadata strings for Tavily/Exa Base URL + Exa key. |
| dashboard/src/components/chat/MessageList.vue | Updates supported tool-call parsing to recognize Exa + findSimilar results for refs. |
| astrbot/dashboard/routes/live_chat.py | Extends supported tool list for extracting <ref> citations (Exa + findSimilar). |
| astrbot/dashboard/routes/chat.py | Extends supported tool list for extracting <ref> citations (Exa + findSimilar). |
| astrbot/core/knowledge_base/parsers/url_parser.py | Adds Tavily Base URL support to the KB URL extractor wrapper. |
| astrbot/core/knowledge_base/kb_helper.py | Plumbs Tavily Base URL into KB “upload from URL” extraction. |
| astrbot/core/config/default.py | Adds new provider settings defaults + metadata for Exa and Base URLs. |
| astrbot/core/astr_agent_hooks.py | Extends webchat citation-injection logic to Exa + findSimilar tools. |
| astrbot/builtin_stars/web_searcher/main.py | Implements Exa tools, adds configurable Base URLs, and adds per-tool optional timeout support. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
- 重构 web_search_utils.py 为分层结构,新增 build_web_search_refs() 和 _extract_ref_indices() 支持从 <ref> 标签提取引用索引 - 简化 chat.py/live_chat.py 中 ref 提取为调用 build_web_search_refs() - MessageList.vue 新增 getMessageRefs() 在后端未返回 refs 时前端自行降级提取 - 修复 chat.py 中消息保存条件判断逻辑
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 370167fb39
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 96e15f79ad
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ret_ls = [] | ||
| for result in results: | ||
| ret_ls.append(f"URL: {result.get('url', 'No URL')}") | ||
| text = await self._tidy_text(result.get("text", "No content")) |
There was a problem hiding this comment.
Handle null text before tidying Exa extraction output
exa_extract_web_page passes result.get("text", "No content") directly into _tidy_text, but Exa responses can include "text": null (or other non-string values) for pages it cannot extract. In that case _tidy_text calls .strip() on a non-string and raises, causing the whole tool call to fail instead of returning partial results. Please coerce text to a safe string before calling _tidy_text.
Useful? React with 👍 / 👎.
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- WEB_SEARCH_REFERENCE_TOOLS is now defined separately in Python (web_search_utils.py) and in the frontend (MessageList.vue); consider centralizing this list or generating the frontend list from the backend to avoid drift when adding/removing tools.
- _normalize_timeout unconditionally casts timeout to int, so values like 30.9 or large numeric strings will be truncated rather than validated; if you expect non-integer or very large values, you might want stricter range checking and clearer error handling instead of silently coercing.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- WEB_SEARCH_REFERENCE_TOOLS is now defined separately in Python (web_search_utils.py) and in the frontend (MessageList.vue); consider centralizing this list or generating the frontend list from the backend to avoid drift when adding/removing tools.
- _normalize_timeout unconditionally casts timeout to int, so values like 30.9 or large numeric strings will be truncated rather than validated; if you expect non-integer or very large values, you might want stricter range checking and clearer error handling instead of silently coercing.
## Individual Comments
### Comment 1
<location path="astrbot/core/tools/web_search_tools.py" line_range="648-657" />
<code_context>
+ max_results = max(1, min(int(kwargs.get("max_results", 10)), 100))
</code_context>
<issue_to_address>
**issue (bug_risk):** Unvalidated int() casts for max_results can raise on malformed input and abort tool execution.
Both `ExaWebSearchTool` and `ExaFindSimilarTool` call `int(kwargs.get("max_results", 10))` directly, so a non-integer value (e.g. "10.5", "foo", or an unexpected type) will raise `ValueError` and abort the tool call. To match how other user-supplied numerics are handled, consider wrapping this in a helper or local try/except that falls back to 10 and then applies the 1–100 clamp, so invalid input degrades gracefully instead of throwing.
</issue_to_address>
### Comment 2
<location path="tests/unit/test_web_search_utils.py" line_range="96-105" />
<code_context>
+ assert [ref["index"] for ref in refs["used"]] == ["a152.1", "a152.2"]
+
+
+@pytest.mark.parametrize(
+ ("base_url", "expected_message"),
+ [
+ (
+ "exa.ai/search",
+ "Error: Exa API Base URL must start with http:// or https://. "
+ "Proxy base paths are allowed. Received: 'exa.ai/search'.",
+ ),
+ ],
+)
+def test_normalize_web_search_base_url_reports_invalid_value(
+ base_url: str, expected_message: str
+):
+ with pytest.raises(ValueError) as exc_info:
+ normalize_web_search_base_url(
+ base_url,
+ default="https://api.exa.ai",
+ provider_name="Exa",
+ )
+
</code_context>
<issue_to_address>
**suggestion (testing):** Extend normalize_web_search_base_url tests to cover empty/None values and multiple invalid schemes.
Key branches still aren’t covered:
- `base_url` is `None` or an empty string → should return the trimmed `default` value. Please add a parametrized test for this.
- Additional invalid values like `"ftp://exa.ai"`, `"https:///only-path"`, or `"http://"` (no netloc) → should raise `ValueError` with the same message format.
These cases will complete coverage for scheme/netloc validation and default handling.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| @pytest.mark.parametrize( | ||
| ("base_url", "expected_message"), | ||
| [ | ||
| ( | ||
| "exa.ai/search", | ||
| "Error: Exa API Base URL must start with http:// or https://. " | ||
| "Proxy base paths are allowed. Received: 'exa.ai/search'.", | ||
| ), | ||
| ], | ||
| ) |
There was a problem hiding this comment.
suggestion (testing): Extend normalize_web_search_base_url tests to cover empty/None values and multiple invalid schemes.
Key branches still aren’t covered:
base_urlisNoneor an empty string → should return the trimmeddefaultvalue. Please add a parametrized test for this.- Additional invalid values like
"ftp://exa.ai","https:///only-path", or"http://"(no netloc) → should raiseValueErrorwith the same message format.
These cases will complete coverage for scheme/netloc validation and default handling.
| if search_type not in ("auto", "neural", "fast", "instant", "deep"): | ||
| search_type = "auto" | ||
|
|
||
| max_results = max(1, min(int(kwargs.get("max_results", 10)), 100)) |
There was a problem hiding this comment.
max_results is parsed with int(kwargs.get("max_results", 10)) without error handling. Tool arguments come from the model and may be strings/invalid; a ValueError here will bubble up and abort the tool execution. Handle non-numeric values gracefully (e.g., try/except with a default, or validate and return a clear error).
| max_results = max(1, min(int(kwargs.get("max_results", 10)), 100)) | |
| raw_max_results = kwargs.get("max_results", 10) | |
| try: | |
| parsed_max_results = int(raw_max_results) | |
| except (TypeError, ValueError): | |
| parsed_max_results = 10 | |
| max_results = max(1, min(parsed_max_results, 100)) |
| results = await _exa_find_similar( | ||
| provider_settings, | ||
| { | ||
| "url": url, | ||
| "numResults": max(1, min(int(kwargs.get("max_results", 10)), 100)), |
There was a problem hiding this comment.
numResults uses int(kwargs.get("max_results", 10)) without guarding against non-numeric input. Since tool args are model-provided, invalid types can raise ValueError and crash the tool call. Wrap the conversion in try/except (defaulting/clamping), or validate and return a tool error message.
| results = await _exa_find_similar( | |
| provider_settings, | |
| { | |
| "url": url, | |
| "numResults": max(1, min(int(kwargs.get("max_results", 10)), 100)), | |
| try: | |
| max_results = int(kwargs.get("max_results", 10)) | |
| except (TypeError, ValueError): | |
| return "Error: max_results must be an integer." | |
| max_results = max(1, min(max_results, 100)) | |
| results = await _exa_find_similar( | |
| provider_settings, | |
| { | |
| "url": url, | |
| "numResults": max_results, |
| return normalize_web_search_base_url( | ||
| provider_settings.get("websearch_tavily_base_url"), | ||
| default="https://api.tavily.com", | ||
| provider_name="Tavily", | ||
| ) |
There was a problem hiding this comment.
_get_tavily_base_url() can raise ValueError via normalize_web_search_base_url(). That exception will propagate out of _tavily_search() and can abort the tool execution (and potentially the whole agent run) instead of returning a normal tool error result. Consider catching ValueError at the call site (or inside _get_tavily_base_url/_tavily_search) and returning a user-facing error string (or falling back to the default base URL with a warning).
| return normalize_web_search_base_url( | |
| provider_settings.get("websearch_tavily_base_url"), | |
| default="https://api.tavily.com", | |
| provider_name="Tavily", | |
| ) | |
| default_base_url = "https://api.tavily.com" | |
| try: | |
| return normalize_web_search_base_url( | |
| provider_settings.get("websearch_tavily_base_url"), | |
| default=default_base_url, | |
| provider_name="Tavily", | |
| ) | |
| except ValueError as exc: | |
| logger.warning( | |
| "Invalid Tavily base URL configuration %r; falling back to %s. Error: %s", | |
| provider_settings.get("websearch_tavily_base_url"), | |
| default_base_url, | |
| exc, | |
| ) | |
| return default_base_url |
| return normalize_web_search_base_url( | ||
| provider_settings.get("websearch_exa_base_url"), | ||
| default="https://api.exa.ai", | ||
| provider_name="Exa", | ||
| ) |
There was a problem hiding this comment.
_get_exa_base_url() can raise ValueError via normalize_web_search_base_url(). Since callers (_exa_search/_exa_extract/_exa_find_similar) don't handle it, an invalid configured base URL can raise out of the tool call and abort execution. Catch ValueError and surface a stable tool error message (or fall back to the default base URL) so misconfiguration doesn't crash the run.
| return normalize_web_search_base_url( | |
| provider_settings.get("websearch_exa_base_url"), | |
| default="https://api.exa.ai", | |
| provider_name="Exa", | |
| ) | |
| default_base_url = "https://api.exa.ai" | |
| try: | |
| return normalize_web_search_base_url( | |
| provider_settings.get("websearch_exa_base_url"), | |
| default=default_base_url, | |
| provider_name="Exa", | |
| ) | |
| except ValueError as exc: | |
| logger.warning( | |
| "Invalid Exa API Base URL configured; falling back to default base URL %s: %s", | |
| default_base_url, | |
| exc, | |
| ) | |
| return default_base_url |
faf411f to
0068960
Compare
zouyonghe
left a comment
There was a problem hiding this comment.
Thanks for the work here. Exa support and the ref-handling cleanup are useful, but I don't think this is ready to merge yet.
I found three issues that should be fixed first:
-
Base URL handling is currently too permissive.
normalize_web_search_base_url()now accepts endpoint URLs such ashttps://api.exa.ai/search, but the call sites still append endpoint suffixes themselves (/search,/contents,/findSimilar,/extract). That can produce broken URLs like.../search/searchand.../extract/extract. This affects both the new Exa path and the Tavily extraction path. -
search_typeallows an unsupported Exa value.
ExaWebSearchToolcurrently acceptsinstant, but Exa's official docs list onlyauto,neural,fast, anddeepfor the search type. Ifinstantis passed through, the upstream API will likely reject it. -
BochaWebSearchToollost its builtin config registration.
In this PR it is registered with plain@builtin_tool, whilemasteruses@builtin_tool(config=_BOCHA_WEB_SEARCH_TOOL_CONFIG). That regresses builtin config status/tag rendering in the dashboard for Bocha.
CI is green, but the current tests do not cover these regression surfaces. I suggest fixing the above and adding focused tests for:
- rejecting or correctly handling endpoint-level base URLs
- the Exa
search_typeallowlist - Bocha builtin config metadata registration
After those are addressed, I think this can be re-reviewed quickly.
|
@zouyonghe On item 2: I re-checked Exa's current official docs before changing the allowlist. As of April 20, 2026, Official links:
Both pages currently list |
|
Replying to review #7359 (review), specifically item 2 ( I re-checked Exa's current official docs before changing the allowlist. As of April 20, 2026, Official links:
So I'm not removing |
… Exa search types - Reject specific API endpoint paths (e.g., /search, /extract) in base URL normalization via new disallowed_path_suffixes parameter to prevent misconfiguration errors - Add deep-lite and deep-reasoning to valid Exa search types and normalize search_type input before validation - Add missing config parameter to BochaWebSearchTool builtin_tool decorator so provider status checks are properly registered
|
@sourcery-ai review |
There was a problem hiding this comment.
Hey - I've found 4 issues, and left some high level feedback:
- The allowed Exa
search_typevalues andcategoryoptions are hard-coded in both_EXA_SEARCH_TYPESand theExaWebSearchTool.parametersdescription; consider centralizing these in shared constants (and reusing in the schema description) to avoid drift between validation and documentation. normalize_web_search_base_urlcan now raiseValueErrorduring request construction (e.g.,_get_exa_base_url,_get_tavily_base_url,URLExtractor.__init__); if you expect misconfiguration in production, consider catching this closer to configuration load or surfacing a clearer, provider-level error instead of letting it bubble up as a runtime exception.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The allowed Exa `search_type` values and `category` options are hard-coded in both `_EXA_SEARCH_TYPES` and the `ExaWebSearchTool.parameters` description; consider centralizing these in shared constants (and reusing in the schema description) to avoid drift between validation and documentation.
- `normalize_web_search_base_url` can now raise `ValueError` during request construction (e.g., `_get_exa_base_url`, `_get_tavily_base_url`, `URLExtractor.__init__`); if you expect misconfiguration in production, consider catching this closer to configuration load or surfacing a clearer, provider-level error instead of letting it bubble up as a runtime exception.
## Individual Comments
### Comment 1
<location path="tests/unit/test_web_search_utils.py" line_range="96-97" />
<code_context>
+
+
+@pytest.mark.asyncio
+@pytest.mark.parametrize(
+ ("search_type", "expected"),
+ [
</code_context>
<issue_to_address>
**suggestion (testing):** Add coverage for defaulting behavior of `normalize_web_search_base_url` when `base_url` is `None` or empty.
Current tests cover invalid schemes and disallowed suffixes. Please also add a parametrized test that passes `base_url=None` and `base_url=" "` and asserts that `normalize_web_search_base_url` returns the normalized `default` value. This will ensure callers like `_get_tavily_base_url`, `_get_exa_base_url`, and `URLExtractor` can rely on the default when the field is left blank.
```suggestion
@pytest.mark.parametrize(
("base_url", "default"),
[
(None, "https://exa.ai"),
(" ", "https://exa.ai"),
],
)
def test_normalize_web_search_base_url_defaults_when_blank(
base_url, default: str
):
normalized = normalize_web_search_base_url(base_url, default=default)
# Should return the normalized value of the provided default when base_url is blank/None
assert normalized == normalize_web_search_base_url(default, default=default)
@pytest.mark.parametrize(
("base_url", "expected_message"),
```
</issue_to_address>
### Comment 2
<location path="docs/en/use/websearch.md" line_range="43" />
<code_context>
+
+Go to [Exa](https://dashboard.exa.ai) to get an API Key, then fill it in the corresponding configuration item.
+
If you use Tavily as your web search source, you will get a better experience optimization on AstrBot ChatUI, including citation source display and more:

</code_context>
<issue_to_address>
**suggestion (typo):** “a better experience optimization” 这一表达不太自然,建议调整为更地道的英语搭配。
建议将该短语改为更自然的表达,例如使用 “a better optimized experience” 或直接 “a better experience”,如:"you will get a better (optimized) experience on AstrBot ChatUI"。
```suggestion
If you use Tavily as your web search source, you will get a better experience on AstrBot ChatUI, including citation source display and more:
```
</issue_to_address>
### Comment 3
<location path="astrbot/core/tools/web_search_tools.py" line_range="384" />
<code_context>
]
+async def _exa_search(
+ provider_settings: dict,
+ payload: dict,
</code_context>
<issue_to_address>
**issue (complexity):** Consider introducing shared helper functions for Exa HTTP calls and POST+timeout handling to remove duplication while keeping behavior unchanged.
You can reduce the new complexity without changing behavior by introducing two small helpers:
1. **Unify Exa HTTP calls**
The three Exa helpers are almost identical except for path, action, and result mapping. A small internal helper keeps all behavior but removes repetition and makes future changes safer:
```python
async def _exa_request(
provider_settings: dict,
path: str,
payload: dict,
action: str,
timeout: int = MIN_WEB_SEARCH_TIMEOUT,
) -> dict:
exa_key = await _EXA_KEY_ROTATOR.get(provider_settings)
url = f"{_get_exa_base_url(provider_settings)}{path}"
headers = {
"x-api-key": exa_key,
"Content-Type": "application/json",
}
async with aiohttp.ClientSession(trust_env=True) as session:
async with session.post(
url,
json=payload,
headers=headers,
timeout=_normalize_timeout(timeout),
) as response:
if response.status != 200:
reason = await response.text()
raise Exception(
_format_provider_request_error(
"Exa", action, url, reason, response.status
)
)
return await response.json()
```
Then the three public helpers become focused on payload + mapping:
```python
async def _exa_search(
provider_settings: dict,
payload: dict,
timeout: int = MIN_WEB_SEARCH_TIMEOUT,
) -> list[SearchResult]:
data = await _exa_request(
provider_settings,
"/search",
payload,
action="web search",
timeout=timeout,
)
return [
SearchResult(
title=item.get("title", ""),
url=item.get("url", ""),
snippet=(item.get("text") or "")[:500],
)
for item in data.get("results", [])
]
async def _exa_extract(
provider_settings: dict,
payload: dict,
timeout: int = MIN_WEB_SEARCH_TIMEOUT,
) -> list[dict]:
data = await _exa_request(
provider_settings,
"/contents",
payload,
action="content extraction",
timeout=timeout,
)
return data.get("results", [])
async def _exa_find_similar(
provider_settings: dict,
payload: dict,
timeout: int = MIN_WEB_SEARCH_TIMEOUT,
) -> list[SearchResult]:
data = await _exa_request(
provider_settings,
"/findSimilar",
payload,
action="find similar",
timeout=timeout,
)
return [
SearchResult(
title=item.get("title", ""),
url=item.get("url", ""),
snippet=(item.get("text") or "")[:500],
)
for item in data.get("results", [])
]
```
2. **Deduplicate timeout plumbing for HTTP calls**
You now repeat `timeout=_normalize_timeout(timeout)` in many places. A very small wrapper keeps the new timeout behavior but centralizes it:
```python
async def _post_json(
session: aiohttp.ClientSession,
url: str,
*,
json: dict,
headers: dict,
timeout: int | float | str | None = None,
):
return await session.post(
url,
json=json,
headers=headers,
timeout=_normalize_timeout(timeout),
)
```
Usage example (Tavily/BoCha/Baidu/Exa):
```python
async with aiohttp.ClientSession(trust_env=True) as session:
async with _post_json(
session,
url,
json=payload,
headers=header,
timeout=timeout,
) as response:
...
```
This keeps all existing behavior (including the minimum timeout and trust_env) but removes a lot of repeated arguments and makes future changes to timeout handling or session usage localized.
</issue_to_address>
### Comment 4
<location path="astrbot/core/utils/web_search_utils.py" line_range="49" />
<code_context>
+ return normalized
+
+
+def _iter_web_search_result_items(
+ accumulated_parts: list[dict[str, Any]],
+):
</code_context>
<issue_to_address>
**issue (complexity):** Consider inlining the small helper functions and simplifying the adapter function to reduce indirection and keep closely related logic together.
You can trim a layer or two without losing clarity or behavior.
### 1. Inline `_iter_web_search_result_items` into `collect_web_search_ref_items`
`_iter_web_search_result_items` is only used once and its logic is straightforward. Inlining it makes the flow easier to follow without losing readability:
```python
def collect_web_search_ref_items(
accumulated_parts: list[dict[str, Any]],
favicon_cache: dict[str, str] | None = None,
) -> list[dict[str, Any]]:
web_search_refs: list[dict[str, Any]] = []
seen_indices: set[str] = set()
for part in accumulated_parts:
if part.get("type") != "tool_call" or not part.get("tool_calls"):
continue
for tool_call in part["tool_calls"]:
if (
tool_call.get("name") not in WEB_SEARCH_REFERENCE_TOOLS
or not tool_call.get("result")
):
continue
result = tool_call["result"]
try:
result_data = json.loads(result) if isinstance(result, str) else result
except json.JSONDecodeError:
continue
if not isinstance(result_data, dict):
continue
for item in result_data.get("results", []):
if not isinstance(item, dict):
continue
ref_index = item.get("index")
if not ref_index or ref_index in seen_indices:
continue
payload = {
"index": ref_index,
"url": item.get("url"),
"title": item.get("title"),
"snippet": item.get("snippet"),
}
if favicon_cache and payload["url"] in favicon_cache:
payload["favicon"] = favicon_cache[payload["url"]]
web_search_refs.append(payload)
seen_indices.add(ref_index)
return web_search_refs
```
That removes one indirection while keeping the logic in a single, obvious place.
### 2. Collapse `_extract_ref_indices` into `build_web_search_refs`
The regex + ordering behavior is easier to understand if the extraction and selection live together:
```python
def build_web_search_refs(
accumulated_text: str,
accumulated_parts: list[dict[str, Any]],
favicon_cache: dict[str, str] | None = None,
) -> dict:
ordered_refs = collect_web_search_ref_items(accumulated_parts, favicon_cache)
if not ordered_refs:
return {}
refs_by_index = {ref["index"]: ref for ref in ordered_refs}
# Inline _extract_ref_indices
ref_indices: list[str] = []
seen_indices: set[str] = set()
for match in re.finditer(r"<ref>(.*?)</ref>", accumulated_text):
ref_index = match.group(1).strip()
if not ref_index or ref_index in seen_indices:
continue
ref_indices.append(ref_index)
seen_indices.add(ref_index)
used_refs = [refs_by_index[idx] for idx in ref_indices if idx in refs_by_index]
if not used_refs:
used_refs = ordered_refs
return {"used": used_refs}
```
This keeps the “how indices are extracted” and the “how refs are ordered/fallback” logic co-located.
### 3. Simplify `collect_web_search_results` (or consider removing it)
If you still need `collect_web_search_results`, it can be a very thin adapter, making the relationship to `collect_web_search_ref_items` obvious:
```python
def collect_web_search_results(accumulated_parts: list[dict[str, Any]]) -> dict:
return {
ref["index"]: {
"url": ref.get("url"),
"title": ref.get("title"),
"snippet": ref.get("snippet"),
}
for ref in collect_web_search_ref_items(accumulated_parts)
}
```
If the only caller is effectively doing this transformation, you could alternatively have that caller perform the mapping itself and drop `collect_web_search_results` entirely to reduce the public API surface.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| @pytest.mark.parametrize( | ||
| ("base_url", "expected_message"), |
There was a problem hiding this comment.
suggestion (testing): Add coverage for defaulting behavior of normalize_web_search_base_url when base_url is None or empty.
Current tests cover invalid schemes and disallowed suffixes. Please also add a parametrized test that passes base_url=None and base_url=" " and asserts that normalize_web_search_base_url returns the normalized default value. This will ensure callers like _get_tavily_base_url, _get_exa_base_url, and URLExtractor can rely on the default when the field is left blank.
| @pytest.mark.parametrize( | |
| ("base_url", "expected_message"), | |
| @pytest.mark.parametrize( | |
| ("base_url", "default"), | |
| [ | |
| (None, "https://exa.ai"), | |
| (" ", "https://exa.ai"), | |
| ], | |
| ) | |
| def test_normalize_web_search_base_url_defaults_when_blank( | |
| base_url, default: str | |
| ): | |
| normalized = normalize_web_search_base_url(base_url, default=default) | |
| # Should return the normalized value of the provided default when base_url is blank/None | |
| assert normalized == normalize_web_search_base_url(default, default=default) | |
| @pytest.mark.parametrize( | |
| ("base_url", "expected_message"), |
|
|
||
| Go to [Exa](https://dashboard.exa.ai) to get an API Key, then fill it in the corresponding configuration item. | ||
|
|
||
| If you use Tavily as your web search source, you will get a better experience optimization on AstrBot ChatUI, including citation source display and more: |
There was a problem hiding this comment.
suggestion (typo): “a better experience optimization” 这一表达不太自然,建议调整为更地道的英语搭配。
建议将该短语改为更自然的表达,例如使用 “a better optimized experience” 或直接 “a better experience”,如:"you will get a better (optimized) experience on AstrBot ChatUI"。
| If you use Tavily as your web search source, you will get a better experience optimization on AstrBot ChatUI, including citation source display and more: | |
| If you use Tavily as your web search source, you will get a better experience on AstrBot ChatUI, including citation source display and more: |
| return normalized | ||
|
|
||
|
|
||
| def _iter_web_search_result_items( |
There was a problem hiding this comment.
issue (complexity): Consider inlining the small helper functions and simplifying the adapter function to reduce indirection and keep closely related logic together.
You can trim a layer or two without losing clarity or behavior.
1. Inline _iter_web_search_result_items into collect_web_search_ref_items
_iter_web_search_result_items is only used once and its logic is straightforward. Inlining it makes the flow easier to follow without losing readability:
def collect_web_search_ref_items(
accumulated_parts: list[dict[str, Any]],
favicon_cache: dict[str, str] | None = None,
) -> list[dict[str, Any]]:
web_search_refs: list[dict[str, Any]] = []
seen_indices: set[str] = set()
for part in accumulated_parts:
if part.get("type") != "tool_call" or not part.get("tool_calls"):
continue
for tool_call in part["tool_calls"]:
if (
tool_call.get("name") not in WEB_SEARCH_REFERENCE_TOOLS
or not tool_call.get("result")
):
continue
result = tool_call["result"]
try:
result_data = json.loads(result) if isinstance(result, str) else result
except json.JSONDecodeError:
continue
if not isinstance(result_data, dict):
continue
for item in result_data.get("results", []):
if not isinstance(item, dict):
continue
ref_index = item.get("index")
if not ref_index or ref_index in seen_indices:
continue
payload = {
"index": ref_index,
"url": item.get("url"),
"title": item.get("title"),
"snippet": item.get("snippet"),
}
if favicon_cache and payload["url"] in favicon_cache:
payload["favicon"] = favicon_cache[payload["url"]]
web_search_refs.append(payload)
seen_indices.add(ref_index)
return web_search_refsThat removes one indirection while keeping the logic in a single, obvious place.
2. Collapse _extract_ref_indices into build_web_search_refs
The regex + ordering behavior is easier to understand if the extraction and selection live together:
def build_web_search_refs(
accumulated_text: str,
accumulated_parts: list[dict[str, Any]],
favicon_cache: dict[str, str] | None = None,
) -> dict:
ordered_refs = collect_web_search_ref_items(accumulated_parts, favicon_cache)
if not ordered_refs:
return {}
refs_by_index = {ref["index"]: ref for ref in ordered_refs}
# Inline _extract_ref_indices
ref_indices: list[str] = []
seen_indices: set[str] = set()
for match in re.finditer(r"<ref>(.*?)</ref>", accumulated_text):
ref_index = match.group(1).strip()
if not ref_index or ref_index in seen_indices:
continue
ref_indices.append(ref_index)
seen_indices.add(ref_index)
used_refs = [refs_by_index[idx] for idx in ref_indices if idx in refs_by_index]
if not used_refs:
used_refs = ordered_refs
return {"used": used_refs}This keeps the “how indices are extracted” and the “how refs are ordered/fallback” logic co-located.
3. Simplify collect_web_search_results (or consider removing it)
If you still need collect_web_search_results, it can be a very thin adapter, making the relationship to collect_web_search_ref_items obvious:
def collect_web_search_results(accumulated_parts: list[dict[str, Any]]) -> dict:
return {
ref["index"]: {
"url": ref.get("url"),
"title": ref.get("title"),
"snippet": ref.get("snippet"),
}
for ref in collect_web_search_ref_items(accumulated_parts)
}If the only caller is effectively doing this transformation, you could alternatively have that caller perform the mapping itself and drop collect_web_search_results entirely to reduce the public API surface.
|
Quick re-check on the latest head:
Please fix the |
surface Exa content extraction status errors with URL and error tag details; extract count validation into reusable _normalize_count helper; pass llm_checkpoint_id through build_message_saved_event parameter
live_chat checkpoint 回归:build_message_saved_event() 已添加 llm_checkpoint_id Exa max_results:ExaWebSearchTool 和 ExaFindSimilarTool 都已改用 _normalize_count() 替代裸 int() 调用,非法值(如 但是我注意到BraveSearchTool仍用裸 int()需要改动嘛?还是就这样 |
|
Re-checked the latest head and the current review threads.
From my side, this looks merge-ready now. The current |
动机
改动点
Exa 搜索提供商:新增三个
@llm_tool工具web_search_exa:语义搜索,支持 5 种搜索类型(auto/neural/fast/instant/deep)和 6 个垂直领域(company/people/research paper/news/personal site/financial report)exa_extract_web_page:通过/contents端点提取网页全文内容exa_find_similar:通过/findSimilar端点查找语义相似网页API Base URL 可配置:Tavily 和 Exa 的 Base URL 可在 WebUI 中自定义,改动覆盖
web_searcher、url_parser、kb_helper等链路可选超时配置:AstrBot 自带联网搜索工具支持可选
timeout参数,默认 30 秒配置元数据 i18n:
default.py新增配置项及条件渲染元数据,en-US/ru-RU/zh-CN三语同步更新工具管理与共享能力整理:
astr_agent_hooks.py在 WebChat 中为搜索类工具补充<ref>index</ref>引用提示,帮助模型输出可追踪的来源标记引用来源链路补全:
<ref>chat.py/live_chat.py共用网页搜索引用提取逻辑MessageList.vue的<ref>引用解析支持 Exa / BoCha /exa_find_similar,不再只识别web_search_tavily<ref>的消息时,仍可降级展示来源列表测试补充:新增
tests/unit/test_web_search_utils.py,覆盖网页搜索结果映射、favicon 透传、显式<ref>命中和无<ref>回退等场景文档:中英文
websearch.md补充default/Tavily/Baidu AI Search/BoCha/Exa的工具说明与参数说明这不是一个破坏性变更
运行截图或测试结果
本地验证命令:
检查清单
Summary by Sourcery
Add Exa as a new configurable web search provider, unify web search tooling and reference handling, and enhance configurability and robustness of web search integrations.
New Features:
Enhancements:
Documentation:
Tests: